{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "GPT2 Interactive Notebook",
      "provenance": [],
      "collapsed_sections": [],
      "authorship_tag": "ABX9TyPmJp0XvWIt2/crASe5lmyP",
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "accelerator": "GPU"
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/github/affjljoo3581/GPT2/blob/master/GPT2_Interactive_Notebook.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "BYbOdA3ENjFx",
        "colab_type": "text"
      },
      "source": [
        "# GPT2 Interactive Notebook\n",
        "\n",
        "## Introduction\n",
        "Welcome! In this notebook, you can play your own trained GPT2 model. This notebook is based on [affjljoo3581/GPT2](https://github.com/affjljoo3581/GPT2). You can play GPT2 model which is trained by [affjljoo3581/GPT2](https://github.com/affjljoo3581/GPT2) in this notebook."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ZimL206uNtNP",
        "colab_type": "text"
      },
      "source": [
        "## Preparation\n",
        "\n",
        "First of all, you need to set *Runtime Type* to **GPU**. Let's check the current GPU device. We recommend to run this notebook on **Telsa T4** or **Tesla P100**."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "bbf0NoKhNBJp",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "import warnings\n",
        "warnings.filterwarnings('ignore')\n",
        "\n",
        "import IPython, torch\n",
        "IPython.display.HTML(f'<p style=\"font-size: 12pt\">Current GPU: <b>{torch.cuda.get_device_name()}</b></p>')"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "tpL5PhnkTjjZ",
        "colab_type": "text"
      },
      "source": [
        "Next, clone GPT2 repository from github. [affjljoo3581/GPT2](https://github.com/affjljoo3581/GPT2) contains not only training, but also text generation and model evaluation."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "lv6adDKVk44o",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "!rm -rf GPT2\n",
        "!git clone --quiet https://github.com/affjljoo3581/GPT2"
      ],
      "execution_count": 3,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "KJrTpBsHUJx4",
        "colab_type": "text"
      },
      "source": [
        "Before playing with GPT2, you need to download trained model file and vocabulary file. Moreover, to evaluate the model, an evaluation corpus file is needed. This notebook supports through [Google Cloud Storage](https://cloud.google.com/storage), so upload the required files to your own storage and specify them to the belows."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "TD1EYuW2k9eI",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "#@title Download resources from Google Cloud Storage\n",
        "\n",
        "model = 'gs://my-bucket/my-model' #@param {type:\"string\"}\n",
        "vocab = 'gs://my-bucket/my-vocab' #@param {type:\"string\"}\n",
        "eval_corpus = 'gs://my-bucket/my-eval-corpus' #@param {type:\"string\"}\n",
        "\n",
        "!gcloud auth login\n",
        "!gsutil -q cp $model model.pth\n",
        "!gsutil -q cp $vocab vocab.txt\n",
        "!gsutil -q cp $eval_corpus corpus.txt"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "q3k2wnIHWGWz",
        "colab_type": "text"
      },
      "source": [
        "Finally, configure the details of GPT2 model."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "itZ4rm4xQiIY",
        "colab_type": "code",
        "colab": {},
        "cellView": "form"
      },
      "source": [
        "#@title Model Configuration\n",
        "seq_len = 128 #@param {type:\"integer\"}\n",
        "layers = 24 #@param {type:\"integer\"}\n",
        "heads = 16 #@param {type:\"integer\"}\n",
        "dims = 1024 #@param {type:\"integer\"}\n",
        "rate = 4 #@param {type:\"integer\"}"
      ],
      "execution_count": 6,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "jWbfBe-zYFpK",
        "colab_type": "text"
      },
      "source": [
        "* `seq_len` : maximum sequence length\n",
        "* `layers` : number of transformer layers\n",
        "* `heads` : number of multi-heads in attention layer\n",
        "* `dims` : dimension of representation in each layer\n",
        "* `rate` : increase rate of dimensionality in bottleneck "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "W5mpBWPtQVpJ",
        "colab_type": "text"
      },
      "source": [
        "## Generate Sentences!\n",
        "According to [The Curious Case of Neural Text Degeneration](https://arxiv.org/pdf/1904.09751.pdf), ***Top-k Sampling*** — a popular sampling procedure — is problematic for both the presence of flat distributions and of peaked ones. The authors claimed that there is a risk of generating bland or generic text in some contexts with small $k$. Also,  the top-k vocabulary can include inappropriate candidates with large $k$. So they proposed ***Nucleus Sampling***. In nucleus sampling, the candidates consist of top-p tokens, rather than top-k ones. That is, the highest probability tokens whose cumulative probability mass exceeds the pre-chosen threshold $p$ would be selected.\n",
        "\n",
        "In this notebook, *nucleus sampling* will be used in text generation. As mentioned above, the hyperparameter `nucleus_prob` which is the threshold $p$ is required."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "d5SpOEralSZY",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "#@title Generation Options\n",
        "nucleus_prob = 0.8 #@param {type:\"slider\", min:0, max:1, step:0.01}\n",
        "\n",
        "import IPython\n",
        "display(IPython.display.HTML('''<style> div.output_text pre {\n",
        "    white-space: pre-line; max-width: 1000px; display: inline-block;\n",
        "} </style>'''))\n",
        "\n",
        "!export PYTHONPATH=GPT2/src; python -m gpt2 generate \\\n",
        "        --vocab_path    vocab.txt \\\n",
        "        --model_path    model.pth \\\n",
        "        --seq_len       $seq_len \\\n",
        "        --layers        $layers \\\n",
        "        --heads         $heads \\\n",
        "        --dims          $dims \\\n",
        "        --rate          $rate \\\n",
        "        --nucleus_prob  $nucleus_prob \\\n",
        "        --use_gpu"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "V3PcWHy8QZWn",
        "colab_type": "text"
      },
      "source": [
        "## Evaluate Model!\n",
        "After training the model, you may want to evaluate your own model performance. The most popular and objective method to evaluate the model is to calculate metrics with test dataset, which is not used during training. First, let's check the number of sequences in evaluation corpus dataset.\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "e2aw8qJpq1nd",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "import IPython\n",
        "lines = !wc -l corpus.txt | awk '{print $1}'\n",
        "IPython.display.HTML(f'<p style=\"font-size: 12pt\">Total Evaluation Sequences: <b>{lines[0]}</b></p>')"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "43NMXzrG4ioG",
        "colab_type": "text"
      },
      "source": [
        "To improve performance, we will use batch in evaluation. That is, the number of total iterations should be `total sequences / batch size`. Usually, larger batch size leads higher efficiency but too large one occurs memory error. So, it is important to decide a proper batch size."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "qRcH5pKMngBR",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "#@title Evaluation Options\n",
        "%%time\n",
        "batch_eval = 256 #@param {type: \"integer\"}\n",
        "total_steps = 100 #@param {type: \"integer\"}\n",
        "\n",
        "!export PYTHONPATH=GPT2/src; python -m gpt2 evaluate \\\n",
        "        --model_path    model.pth \\\n",
        "        --eval_corpus   corpus.txt \\\n",
        "        --vocab_path    vocab.txt \\\n",
        "        --seq_len       $seq_len \\\n",
        "        --layers        $layers \\\n",
        "        --heads         $heads \\\n",
        "        --dims          $dims \\\n",
        "        --rate          $rate \\\n",
        "        --batch_eval    $batch_eval \\\n",
        "        --total_steps   $total_steps \\\n",
        "        --use_gpu"
      ],
      "execution_count": null,
      "outputs": []
    }
  ]
}